1,978 research outputs found
Pb-Hash: Partitioned b-bit Hashing
Many hashing algorithms including minwise hashing (MinHash), one permutation
hashing (OPH), and consistent weighted sampling (CWS) generate integers of
bits. With hashes for each data vector, the storage would be
bits; and when used for large-scale learning, the model size would be
, which can be expensive. A standard strategy is to use only the
lowest bits out of the bits and somewhat increase , the number of
hashes. In this study, we propose to re-use the hashes by partitioning the
bits into chunks, e.g., . Correspondingly, the model size
becomes , which can be substantially smaller than the
original .
Our theoretical analysis reveals that by partitioning the hash values into
chunks, the accuracy would drop. In other words, using chunks of
bits would not be as accurate as directly using bits. This is due to the
correlation from re-using the same hash. On the other hand, our analysis also
shows that the accuracy would not drop much for (e.g.,) . In some
regions, Pb-Hash still works well even for much larger than 4. We expect
Pb-Hash would be a good addition to the family of hashing methods/applications
and benefit industrial practitioners.
We verify the effectiveness of Pb-Hash in machine learning tasks, for linear
SVM models as well as deep learning models. Since the hashed data are
essentially categorical (ID) features, we follow the standard practice of using
embedding tables for each hash. With Pb-Hash, we need to design an effective
strategy to combine embeddings. Our study provides an empirical evaluation
on four pooling schemes: concatenation, max pooling, mean pooling, and product
pooling. There is no definite answer which pooling would be always better and
we leave that for future study
Constrained Approximate Similarity Search on Proximity Graph
Search engines and recommendation systems are built to efficiently display
relevant information from those massive amounts of candidates. Typically a
three-stage mechanism is employed in those systems: (i) a small collection of
items are first retrieved by (e.g.,) approximate near neighbor search
algorithms; (ii) then a collection of constraints are applied on the retrieved
items; (iii) a fine-grained ranking neural network is employed to determine the
final recommendation. We observe a major defect of the original three-stage
pipeline: Although we only target to retrieve vectors in the final
recommendation, we have to preset a sufficiently large () for each
query, and ``hope'' the number of survived vectors after the filtering is not
smaller than . That is, at least vectors in the similar candidates
satisfy the query constraints.
In this paper, we investigate this constrained similarity search problem and
attempt to merge the similarity search stage and the filtering stage into one
single search operation. We introduce AIRSHIP, a system that integrates a
user-defined function filtering into the similarity search framework. The
proposed system does not need to build extra indices nor require prior
knowledge of the query constraints. We propose three optimization strategies:
(1) starting point selection, (2) multi-direction search, and (3) biased
priority queue selection. Experimental evaluations on both synthetic and real
data confirm the effectiveness of the proposed AIRSHIP algorithm. We focus on
constrained graph-based approximate near neighbor (ANN) search in this study,
in part because graph-based ANN is known to achieve excellent performance. We
believe it is also possible to develop constrained hashing-based ANN or
constrained quantization-based ANN
- …